Systems and methods for scalable structured data distribution

ABSTRACT

Systems and methods for efficiently absorbing, archiving, and distributing any size data sets are provided. Some embodiments provide flexible, policy-based distribution of high volume data through real time streaming as well as past data replay. In addition, some embodiments provide for a foundation of solid and unambiguous consistency across any vendor system through advanced version features. This consistency is particularly valuable to the financial industry, but also extremely useful to any company that manages multiple data distribution points for improved and reliable data availability.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to U.S. Provisional Application Ser.No. 61/623,877 filed Apr. 13, 2012, which is incorporated herein in itsentirety by reference for all purposes.

TECHNICAL FIELD

Various embodiments of the technology of the present applicationgenerally relate to data delivery. More specifically, some embodimentsof the technology of the present application relate to systems andmethods for scalable structured data distribution.

BACKGROUND

There is growing regulatory and competitive pressure on variousindustries to improve the quality, consistency, and availability ofreported data. Storage and processing demands are increasing alongmultiple dimensions such as granularity, online history, redundancy, andcollections for joining together new combinations of data. In addition,intra-day versioning is becoming necessary for managing discrepanciesbetween departments with different timing needs as data is increasinglyshared across departments within a company. Departments also arestarting to look for the road that will take them from batch processingto incremental real-time and stream data management.

While demand for efficient and consistent data management is growing,many large companies are replacing failing ACID (Atomicity, Consistency,Isolation, and Durability) architecture with scalable BASE architecture.Solutions to view and analyze large to huge datasets are becomingcommonplace as these companies release aspects of their cloud-scalingsystems to open source. While hyper-scale analysis engines are becomingcommonplace, tools to manage movement of data sets have not kept pace.Large companies are scrambling to protect themselves from growinglikelihood of outages because they lack means to manage the availabilityof large data streams.

Many other companies face the same inability to replicate growing datasets. ACID architectures are costly, complex, and wrong for ensuringthat data is consistent and available across space and time (e.g.,department data sharing and forensics). A higher bar for availability,consistency, and governance of these growing data sets is consistentlybeing set.

SUMMARY

Systems and methods are described for scalable structured datadistribution. In some embodiments, a method can include receivingstreaming raw data from a data producer. The data can be bundled intopackages of data (i.e., bundles) based on an archiving strategy. In somecases, any metadata associated with the streaming data is leveraged forefficient policy driven routing. The metadata can be published, possiblyrecursively, on one or more channels (e.g., a control channel). Each ofthe packages of data may be ordered using a series of consecutiveintegers produced by a master clock. The packages of data can then bearchived and delivered (e.g., in parallel) to consumers, which havesubscribed to the data producer. The packages of data can be replayedbased on the ordering identified by the consecutive integers upon arequest from a data consumer.

Embodiments of the technology of the present application also includecomputer-readable storage media containing sets of instructions to causeone or more processors to perform the methods, variations of themethods, and other operations described herein.

Some embodiments include a system comprising a bundler, a transformer, astream clock, and an archiving service. The bundler can be configured toreceive streaming raw data from a data producer and bundle the data intoa series of data packages by associating each of the data packages witha unique identifier having a monotonically increasing order. Thetransformer can receive the data packages (e.g., from an archive) andgenerate loadable data structures for a reporting store associated witha data subscriber. The loader can receive and store the loadable datastructures into a storage device associated with the data subscriberbased on the logical ordering.

Some embodiments can include a master clock configured to generate alogical series of integers, each of which is associated with a singledata package in the business aligned, policy driven (declarative) seriesof data packages. In various embodiments, the system can include a datachannel allowing data from a data producer to be continuously streamedto the data subscriber. In addition, a messaging channel can be used toprovide a current status of the data being continuously streamed fromthe data producer to the data subscriber through the data distributionsystem. A control channel separate from the data channel to allow thedata subscriber to request replay of the data may also be used in someembodiments.

While multiple embodiments are disclosed, still other embodiments of thetechnology of the present application will become apparent to thoseskilled in the art from the following detailed description, which showsand describes illustrative embodiments of the technology. As will berealized, the technology is capable of modifications in various aspects,all without departing from the scope of the present technology.Accordingly, the drawings and detailed description are to be regarded asillustrative in nature and not restrictive.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the present technology will be described and explainedthrough the use of the accompanying drawings in which:

FIG. 1 illustrates an example of an environment in which someembodiments of the present technology may be utilized;

FIG. 2 illustrates phases of operation of a data distribution system inaccordance with one or more embodiments of the present technology;

FIG. 3 is a flowchart illustrating a set of operations for bundling datain accordance with various embodiments of the present technology;

FIG. 4 is a flowchart illustrating a set of operations for processingdata streams in accordance with some embodiments of the presenttechnology;

FIG. 5 illustrates a set of components of a data distribution system inaccordance with one or more embodiments of the present technology;

FIG. 6 is a flowchart illustrating a set of operations for deliveringdata in accordance with various embodiments of the present technology;

FIG. 7 illustrates an overview of a data distribution systemarchitecture which can be used in one or more embodiments of the presenttechnology; and

FIG. 8 illustrates an example of a computer system with which someembodiments of the present technology may be used.

The drawings have not necessarily been drawn to scale. For example, somecomponents and/or operations may be separated into different blocks orcombined into a single block for the purposes of discussion of some ofthe embodiments of the technology of the present application. Moreover,while the technology is amenable to various modifications andalternative forms, specific embodiments have been shown by way ofexample in the drawings and are described in detail below. Theintention, however, is not to limit the scope of the application to theparticular embodiments described. On the contrary, the application isintended to cover all modifications, equivalents, and alternativesfalling within the scope of the technology as defined by the appendedclaims.

DETAILED DESCRIPTION

Various embodiments of the technology of the present applicationgenerally relate to data management (e.g., the storage and movement ofbig data). More specifically, some embodiments relate to systems andmethods for scalable structured data distribution. Some embodimentsprovide for a data bus suitable for reliably distributing large volumesof data to multiple clients in parallel. In addition, some embodimentsinclude an integrated system for efficiently absorbing, archiving, anddistributing any size data sets as well as providing flexible,policy-based distribution of high volume data through real timestreaming as well as past data replay.

Data consumers often desire data to be consistent, available, andpartitioned (“CAP”). Achieving all of these attributes instantaneouslyis often difficult. Delayed consistency is the favorable compromise tomake in many institutions where consistency and availability is crucial,but some timing delay can be acceptable. As such, some embodiments ofthe data distribution system disclosed herein hold consistency,availability, and partitioning sacred, while giving ground only on thetiming of consistency. Through a unique clocking scheme used to tagdata, various embodiments of the data distribution system achieve therequired CAP, eventually.

In order to address scale-out requirements stemming from regulatory andcompetitive pressures, various embodiments provide for a data flowsolution leveraging the BASE architecture. Various embodiments providefor an integrated system for efficiently absorbing, archiving, anddistributing any size data sets. The integrated system can provide forscalable distribution (i.e., efficient, simultaneous streaming to anynumber of consumers), consistency (i.e., consistent live backup, datasharing and forensics support), agility (i.e., vendor independence andrapid adoption of analysis engines), governance (i.e., securepolicy-driven management over distribution), and/or forensics (i.e.,replay and restore past versions of data at high speed).

In addition, some embodiments of the integrated data distribution systemallow developers to identify data in simple terms (schema and businesspurpose) and submit high volumes of data into a bus where policiesgovern storage, transformation, and streaming into multiple targetssimultaneously. All versions of data sent through the bus can becompressed and stored and then replayed seconds, days or years laterwith guaranteed consistency into any target. Some embodiments includefeatures that can be applied more broadly including an adaptablecomponent-based, message-driven architecture, stateless and declarativesetup, and special compression and optimization features.

Some embodiments bridge data flows with cost-saving cloud technologiesacross internal and external domains. To this end, some embodiments ofthe technology can include features to support a variety ofdistributed/non-distributed flow combinations, such as, but not limitedto the following: 1) high volume data capture and replay; 2)policy-based encryption for security on the wire and on disk; 3)high-throughput parallel transport; 4) en-route parallel processing ortransformation; 5) superior structured compression; 6) ability tomonitor and manage data processing and storage costs at a businesslevel; and/or 7) flexible adapters into and out of repeatable dataflows.

While many traditional systems use an imperative data flow, variousembodiments of the present technology use rule-based or “declarative”data flow. As a result, provision, subscription, channeling, archiving,and entitlement of all data may be decoupled from any proprietaryimplementation through a set of well-understood rules. In addition, datacan be abstracted from the underlying repository models. Becausecaptured data is kept in ordered, raw form, the data can be replayedfrom any point in the past into new database solutions, providing anexcellent platform for fast adoption of new technologies. Built-inconsistency mechanisms help manage simultaneous flows in someembodiments. This allows different data stores to be populated and usedin parallel. By using consistent stores in parallel, a new hybridreporting architecture becomes possible. For example, the combinedadvantage of tandem relational and NoSQL engines can be made availableto the application layer in ways that give new performance andcost-scaling dynamics for large data reporting challenges.

In the following description, for the purposes of explanation, numerousspecific details are set forth in order to provide a thoroughunderstanding of embodiments of the technology. It will be apparent,however, to one skilled in the art that embodiments of the technologymay be practiced without some of these specific details.

Moreover, the techniques introduced here can be embodied asspecial-purpose hardware (e.g., circuitry), as programmable circuitryappropriately programmed with software and/or firmware, or as acombination of special-purpose and programmable circuitry. Hence,embodiments may include a non-transitory machine-readable medium havingstored thereon non-transitory instructions that may be used to program acomputer (or other electronic devices) to perform a process. Themachine-readable medium may include, but is not limited to, opticaldisks, compact disc read-only memories (CD-ROMs), magneto-optical disks,ROMs, random access memories (RAMs), erasable programmable read-onlymemories (EPROMs), electrically erasable programmable read-only memories(EEPROMs), application-specific integrated circuits (ASICs), magnetic oroptical cards, flash memory, or other type of media/machine-readablemedium suitable for storing electronic instructions.

Terminology

Brief definitions of terms, abbreviations, and phrases used throughoutthis application are given below.

The terms “connected” or “coupled” and related terms are used in anoperational sense and are not necessarily limited to a direct physicalconnection or coupling. Thus, for example, two devices may be coupleddirectly, or via one or more intermediary media or devices. As anotherexample, devices may be coupled in such a way that information can bepassed therebetween, while not sharing any physical connection with oneanother. Based on the disclosure provided herein, one of ordinary skillin the art will appreciate a variety of ways in which connection orcoupling exists in accordance with the aforementioned definition.

The phrases “in some embodiments,” “according to some embodiments,” “inthe embodiments shown,” “in other embodiments,” and the like generallymean the particular feature, structure, or characteristic following thephrase is included in at least one implementation of the presenttechnology, and may be included in more than one implementation. Inaddition, such phrases do not necessarily refer to the same embodimentsor different embodiments.

If the specification states a component or feature “may”, “can”,“could”, or “might” be included or have a characteristic, thatparticular component or feature is not required to be included or havethe characteristic.

General Description

FIG. 1 illustrates an example of an environment 100 in which someembodiments of the present technology may be utilized. As illustrated inFIG. 1, data producers 110A-110N produce data that is distributed overdata distribution network 120 to data consumers 130A-130N. In accordancewith various embodiments, data distribution network 120 achievesscalable data flow through a combination of features. Data generated bydata producers 110A-110N can be structured, and flow processingcomponents can all be given awareness of this structure. This awarenesscan be injected at runtime and can even be injected per work cycle, ifneeded, for more dynamic resource sharing. By structuring the datagenerated by the data producers in a systematic manner, variousembodiments can take advantage of one or more of the followingfeatures: 1) efficient in-process and on disk columnar compression; 2)separation of data and control flow; 3) transparency and monitoring offlow by surfacing important business constructs; 4) ability toefficiently manage streams through subscriptions, filtering, forking, ormerging; and 5) ability to implant structured data services into theflow such as key-generation services upstream of target data stores.

In some embodiments, data distribution network 120 includes flowprocessing components that push all products to external services forstate management and recovery on a per work-cycle basis. Services thatsave down state can be specialized for that purpose. In this way, allthe heavy processing components of the system recover off of externalservice state and can be made completely dynamic from cycle to cycle.The stateless rule of processing is to read progress from the downstreamproduct through a designated stateful service that specializes inmanaging state (no processing). Similarly, flow processing componentsmay only receive the input pushed to them or request it from a declaredupstream service.

One way to visualize embodiments of the design is to think of flowprocessing components as segments of pipe that can be dynamicallyattached to compatible upstream and downstream segments of pipe. Pipeconnectivity is highly dynamic in that more processing can be spun upbetween stateful components without need to upload or restore anyprevious state information. In various embodiments, there is only onestateful service component from which all other stateful services caneventually recover. This is the raw data archive service 140. Archivingrequirements of all downstream state services can be relaxed in someembodiments.

FIG. 2 illustrates phases of operation of a data distribution system 200in accordance with one or more embodiments of the technology. Asillustrated in FIG. 2, data distribution system 200 is able to ingeststreaming data from one or more data producers 110A-110N. As explainedfurther below, the data is then bundled into discrete data packages thatcan be compressed (e.g., using columnar compression). These discretedata packages are then distributed to one or more data consumers.Substantially simultaneously, the discrete data packages may be provideda unique identifier and archived (not shown in FIG. 2). The dataconsumers can then unpack the data from the discrete data packages andgenerate reports or otherwise use the data.

Data distribution system 200 may have a pluggable componentarchitecture. The data distribution system can have processingcomponents and stateful services. In some embodiments, the processingcomponents can include two-channel pipe segments—one channel for dataflow and one channel for control flow. The stateful services may alsohave separate data and control channels, but serve as demultiplexors formultiple destination fan-out as illustrated in FIG. 2. The componentscan be language and platform independent. Contracts between componentscan be implemented through messaging subscription/publication, whichincludes instrumentation output through narrow delivery APIs of astore/fetch nature and through process control APIs forstartup/shutdown.

External components behave very similarly from an operationalstandpoint. They publish consistent instrumentation information alignedwith the distributed master clock so that progress and capacity can beunderstood through the system. Instrumentation can be aligned with theunique identifiers of the data packages (i.e., bundle IDs). That is,there are instrumentation events at the beginning of consuming a bundleor upon completion of processing a bundle. This aspect ofinstrumentation lines up the instrumentation events with the masterclock. Lining up instrumentation with master clock events allows for theinstrumentation aligned with the stream clock events to act as a controlevent.

Components of data distribution system 200 extend beyond basic transportand can be used in any step towards end-delivery of data. Someembodiments provide for a data-curation process that shapes data into aformat accessible in the future. If the data needs to be recovered orreplayed, the archive can locate and retransmit an exact replica of thedata based on the unique identifiers. For example, would also follow thedual-pipe stateless process model with consistent instrumentation. Inthis way, capacity and monitoring can be managed with a single toolsetall the way through the flow.

Data distribution system 200 can maximize event efficiency becausedurability requirements are not essential downstream from the archive.Durable messages are only required coming into a bundler for packagingand archiving. From that point on, all data is repeatable in view of theunique identifier and packaging. Non-durable topics are used to publishall activity out of and downstream of the bundler, greatly reducing theinfrastructural burden.

By making the component configuration of the data distribution system200 declarative, the system can dynamically spin up flow processingcomponents, as explained above. By making data structure and data sourcemodels declarative, the system provides operational transparency onthose sources and a simple means to build a public catalog of what datais available from where. Declarative subscriptions can be used toprecisely understand the data flows shared between departments usingdata distribution system 200. All data and control signals may be pushedthrough the system, thus allowing a declarative contract and decouplingbetween publishers and subscribers of data along with dynamic setup andscaling.

FIG. 3 is a flowchart illustrating a set of operations 300 for bundlingdata in accordance with various embodiments of the present technology.Receiving operation 310 receives streaming data from one or more dataproducers. Determination operation 320 determines bundling parameters(e.g., bundle size) for bundling the disparate small pieces of thestreaming data in accordance with the bundling parameters. The bundlingparameters can include business rules that allow a business to determinehow the data should be grouped or bundled (e.g., by a bundler asillustrated in FIG. 5) into the data packages (e.g., based on content,source, expected use, etc.) In some cases, the bundler can aggregate rawmessages or data before any transformation is performed. The bundles canbe assigned a monotonically increasing bundle ID or other uniqueidentifier to sequentially order the bundles. This unique ID may be apublic ID that is leveraged in reporting for versions, verification, andany other query activity that needs to leverage guaranteed order ofbundle stream data.

Storing operation 330 stores the bundled data packages in an archive.Publishing operation 340 can then publish metadata associated with thebundles on a control channel. In some embodiments, the metadata caninclude bundle IDs. In addition to publishing the metadata on thecontrol channel, publishing operation 350 may publish the metadata to aseparate stream. While the control information can be sent out on anon-durable topic, an identical copy of the control data may beavailable with the data itself. Some advantages of including the controldata with the packages include the following: 1) exact behavior of thesystem can be read and replayed from the archive as the package containsall events and data; and 2) control summary information including indexinformation located in the bundle itself allows for fast scanning of thefiles (just the index), eliminating the need to scan through the entirefile in many cases.

FIG. 4 is a flowchart illustrating a set of operations 400 forprocessing data streams in accordance with some embodiments of thetechnology. One or more of these operations can be performed by varioussystem components such as a submission service or a bundler. Asillustrated in FIG. 4, receiving operation 410 receives streaming datafrom one or more data producers. Bundling operation 420 bundles thestreaming data into data packages, as illustrated in FIG. 3, that areassigned a unique identifier by ordering operation 430. Deliveryoperation 440 delivers the data packages with the assigned uniqueidentifier to one or more data consumers.

Bundling operation 420 can aggregate disparate small pieces ofinformation that enter the data distribution system into largerlogically-clocked blocks of data, data packages, or bundles. Theaggregation of large data flows performed by bundling operation 420allows the system to leverage the separation of data and control. Bytuning bundle size and controlling information that is included in thebundle's metadata, the system can create many-fold efficiency formanaging data by the bundle's metadata (a.k.a. control data). Variousembodiments allow for the selection of the content of the control datain order to tune decisions for large data flows.

Aggregation or “chunking” of the data by bundling operation 420 also hasthe direct and simple benefit of improving IO performance. In someembodiments, bundling is closely related to serialization andidentifying the order of the bundles by a unique ID, which can be usefulin creating reliable distribution of data. Data flow cannot be reliablyreproduced in multiple locations (e.g., primary and backup) withoutconsistent ordering given by the serialization and identification.

FIG. 5 illustrates a set of components of a data distribution system 500in accordance with one or more embodiments of the technology. Asillustrated in FIG. 5, data distribution system 500 can include bundlers510, archive 520, transformers 530, repository services 540, loaders550, reporting stores 560, and storage 570.

From a data flow perspective, bundlers 510 may sit between a submissionservice and archive service 520. The submission service, while notshown, receives the raw streaming data from one or more data producersand feeds the raw data into the bundlers 510. Bundlers 510 createserially ordered bundles for specific data streams—that is, there is aspecific set of coordinated bundler processes for each type of messageflow. These coordinated processes create an aggregated datarepresentation and sends it to the archive service.

Given that bundles are generated and flowed in monotonically increasingorder, different “bundle streams” may be created to facilitate dynamicacceleration of different parts of a data flow where some bundles aremade to flow faster than others based on some selection criteria. Aselection of metadata can be used to identify and separately enumerate aseries of bundles.

As an example, multiple bundles can be constructed by the followingcriteria: Business Date,Type:={Greeks|CreditGreeks|UnderlierAttributes}, andRegion:={GL|AS|EU|AM}. This will result in twelve bundle streams flowingthrough the data distribution system per business date; the uniqueidentifier for each bundle stream must be separate to allow forprioritization, and gapless and ordered monotonically incrementingunique IDs.

In some embodiments, the bundle contents can include control data thatis packaged together with data into bundles. While the controlinformation also is sent out on a non-durable topic, an identical copyof it is available with the data itself. Some advantages include thefollowing: 1) exact behavior of the system can be read and replayed fromthe archive since the bundles contain all events and data; and 2)control summary information including index information located in thebundle itself allows for fast scanning of the files (just the index),thereby reducing or eliminating the need to scan through the entirefile.

A bundle can be an aggregate of raw messages before any transformationis performed. Bundles may be serialized via the unique identifier, whichmay be a monotonically increasing bundle ID. This unique identifier maybe a public ID that is leveraged in reporting for versions,verification, and any other query activity that needs to leverageguaranteed order of bundle stream data. A bundle message may include anyof the five following sections: 1) a summary section; 2) a qualitysection; 3) an index section; 4) a checksum section; and/or 5) a datasection.

The summary segment of the bundle can be used to uniquely identify abundle with key information and also those top-level attributes of abundle that are generally interesting across all types of data such asthe data stream that is bundled, the bundle series number, row counts oftabular data, and data segment size. In some embodiments, the summarysegment may be sent independently of all other sections as a fastcontrol signal.

The quality sections are optional light-weight reserved segments inbundle messages that are at the discretion of sourcing processes topopulate. For example, a pricing process might supply indicators aboutthe nature of the data it is producing such as sanity check failures. Byplacing sanity check failures in a special area for consumptionindependently of data, sourcing can supply higher-velocity informationto subscribers to this quality segment. Infra teams might subscribe tosuch signals to know ahead-of-time of bad data as soon as it hits thebundler. MR analysts might also subscribe to certain signals to getearly warning of sanity check failures. The size of these messagesegments will be physically limited, and the usefulness will depend onhow well-organized the data is when provided.

The index segments are optional bundle segments that indicate thecontent of the bundle in a condensed form that is useful for trackingand problem-solving. Unlike quality segments, index segments may bebuilt by the bundler using a business aligned policy based key-ruleapplied to tabular data. Index segments might be a fraction of the sizeof the data itself, but are expected to be sometimes larger than summaryand quality segments.

The checksum message segment can be included in messages that containthe data segment. By including the checksum message segment, the dataloads become independently verifiable in all distributed locations. Insome embodiments, the checksum message segment includes a checksumrepresenting an aggregation/sum/hash of all columns of the tabular datain the data section.

In some embodiments, the data segment of the bundle can be a tabularreformulation of a collection of messages or data. The data can be keptin a neutral form. The format of the data may be selected by balancingbetween keeping consistent with incoming message formats and beingoptimized for transformation into several other forms. Typically, thedata segment can be any size up to the maximum size allowed for abundle.

In some embodiments, the archive service 520 stores complete bundlesthat include all segments. By storing complete bundles, fault toleranceas well as back-population of new systems using replay can be provided.In addition, storing complete bundles also has an optimization benefitand impact to light-weight messaging since control messages need not bedurable; if all message segments are retained in an available archivestore, then non-durable messaging strategies can simply be backed byarchive polling to take advantage of more efficient distribution.

The summary segment provides the key information to uniquely identify abundle and is what allows different messages about the same bundle to beassociated. Non-key elements of the summary section provide some generalcharacteristics that are useful for viewing and identifying a segmentvisually in monitoring tools (e.g., data count, size, data flow, andbundle number). Messages that have the data segment may also include thechecksum segment. By including checksum segments with data segments,distributed independent locations can independently verify data qualitypost-transformation and upload without needing to cross-reference otherlocations. This independence is a key enabler of low-touch reliabledistribution of large amounts of data.

Using these constraints, any other message combinations can be createdaccording to what is needed. A simple strategy is to create alighter-weight message that includes everything except data as thecontrol message. If segments are consistently light-weight, thisstrategy results in just two bundle types for a particular datastream—the control message including summary, quality, index, andchecksum information, and a full data message that includes all of theseplus data. The bundler could publish these two message types accordingto rules configured for dynamic message cargo.

As an example, the summary section may contain the following structure:the name of each value is shown before the “:=”, and the value caneither be an enumerated list of exclusive items, a type of data oranother structure.

Summary := { SourceComponentType := { Bundler | Transformer | Loader }SourceComponentId := <arbitraryString>:<instanceNumber> BundleStream :={ BusinessDate := <date> Type := { Greeks | CreditGreeks |UnderlierAttributes } Region := { GL | AS | EU | AM } } Size :=<integer> ComponentFields := { bundleId := <integer> // used by bundleonly firstBundleId := <integer> // used by transformer/loaderlastBundleId := <integer> // used by transformer/loader } }

Additionally, values that need to be selected on or inspected prior tomessage parsing may be included in the Message properties section(limited to strings and numbers only). For example, to allow forfiltering, a stringified version of the bundle stream will be added as astring message property. Dates, time stamps, and any data that must beparsed to interpret should only be included in the payload of themessage (the JSON text).

Data is bundled and tagged with a unique identifier by bundlers 510 andplaced into archive 520. Transformers 530 pull data from archive 520 toconvert it into consumer-friendly load data. Loaders 550 aligned withtarget reporting stores 560 ensure the data is loaded according tocontract (serial and atomic bundles).

Archive 520 and repository services 540 manage state storage. Thearchive service can repeat all flows through the system. Achievingconsistency in distributed data requires a solution to ordering ofevents that affect data. Data distribution system 500 achievesconsistency by distributing events as data and achieving order andintegrity of that delivery to all consumers via the monotonicallyincreasing unique identifiers. To put this in terms familiar to databaseadministrators, data distribution system 500 manages the delivery of thetransaction log components in an efficient and consistent manner to alldatabases and data stores of all types. The transaction log componentstream is the flow. This is a reverse of the way technologists usuallythink about databases. The transaction log is normally considered to bebacking the data in a database, not at the forefront of multipledatabases.

To guarantee consistency under adverse circumstances, this stream oftransaction data is persisted and made re-playable. This persistedre-playable stream is the center of the pipeline state. It is the onlystate that needs to be carefully managed for recoverability. Allcomponents and state downstream of this stream are recoverable atwhatever rate the stream can flow.

Reversing the positioning of transaction log and database in a systemcan add complexity to applications built upon such a paradigm.Applications built on top of this type of delivery need to be mindful ofversion metadata if they need to compare data between different physicallocations. One goal of the data distribution system is making thisversioning paradigm as adoptable and simple as possible by establishingthe simplest possible foundation for a distributed versioning scheme. Atthe heart of this versioning scheme is a new clocking methodology aimedat precise distribution.

Various embodiments of the present technology use monotonicconsumer-aligned clocks and streams. One example of a clocking andordering device is an increasing series of integers without gaps. Datadistribution system 500 guarantees that for every designated datastream, there is one and only one contiguous series of integer ticksordering the data (i.e., the Master Clock).

The most important aspect of the Master Clock is the alignment relativeto the consumer. The ticks of the clock are aligned with consumerupload. Each tick represents a loadable set of data or “bundle”. Forclients to get the benefits of consistency in the system, they must loadeach bundle in the transaction and in the order delivered by the clock.In other words, each bundle is an atomic upload client. This is madeeasier for clients by having a deterministic naming scheme with which tofetch any needed bundle and a deterministic series of bundles to bedelivered (monotonically increasing order).

Data streams can be configured from any data series to establishindependent data flow. What this technically means is that the streamwill have its own independent series of clock ticks. This allows thestream to be run at a different rate from other streams. Any set ofdimensions in a data set can be called out to declare a stream.Typically, these are aligned with different flow priority. For example,if a data set consisting of client price valuations were divided intostreams by business date and business prices for a select business date,business could be allowed to flow at a higher priority to otherbusinesses.

Some embodiments of the present technology allow for the ability toreference streaming data by allowing a contract to create referencedimmutable archives immediately after consuming and writing down thedata. The bundles that are created for distribution are foreveridentifiable through a well-defined reference strategy, which relies ontheir streaming package (bundle) identity. This approach, along withbusiness-aligned archiving rules, bridges streaming and archive dataalmost as soon as data is written to disk. Every signal about the dataor report related to the data can leverage permanent reference to thepackage. Data lineage is given very immediate (near real-time) supportin a streaming environment.

Distributed delivery to multiple targets could introduce a nightmarishreconciliation requirement. Various embodiments of data distributionsystem 500 include reconciliation in the flows through checksums.Checksums can be included in control messages with every data bundle.These checksums are used to ensure that data is as expected afterdelivery to target store.

Data distribution system 500 can employ a columnar checksum strategy.This strategy is threefold effective over row checksums. First, columnarchecksums are in line with columnar compression optimizations. They canbe much more efficient when compared with row checksums. Second,columnar checksums can work across different data stores and be giventolerance for rounding errors across platforms with different floatrepresentation. Third, columnar checksums are a more effectivecombination with columnar subscriptions where clients subscribe to onlya subset of columns of data.

FIG. 6 is a flowchart illustrating a set of operations 600 fordelivering data in accordance with various embodiments of the presenttechnology. As illustrated in FIG. 6, there can be two threads runningwithin the data distribution system. One thread can allow the datapackages received from produces to be archived and automatically pushedto any data subscribers. The second thread can allow data subscribers torequest that some of the data be replayed.

Receiving operation 610 receives streaming data from one or more dataproducers. Bundling operation 620 can bundle the data into data packagesthat are assigned unique identifier which are ordered. The data packagescan then be archived using archiving operation 630. Transformationoperation 640 transforms the data packages into a loadable formatrequested by the subscriber. The data packages can then be delivered tothe data consumers in the desired format using delivery operation 650.

When replay operation 660 receives a request to replay some of the datapackages, then retrieval operation 670 can retrieve the desired datapackages from the archive. These data packages are then transformed,using transformation operation 640, into a loadable format requested.The data packages can then be delivered in the desired format usingdelivery operation 650.

FIG. 7 illustrates an overview of a data distribution systemarchitecture 700 which can be used in one or more embodiments of thepresent technology. FIG. 7 shows how message and control data arereplicated exactly from Archive to DC2 Archive. There are two basicoptions for fault tolerance of the Archive. Guaranteed—local replicationand across partition (e.g., to the DC2 archive). For MRT, the data flowis reproducible, and so local guarantees with 15 minute replication topartition satisfy those needs. For transactional activity, higherdegrees of guarantee of data replication across partitions may bewarranted.

The archive service employs tiered storage to get maximum throughput.Tiered storage allows the archive to fulfill its role as a high volumedemultiplexer (write once, read many). Data can be written to thearchive service, and subsequently, almost immediately, read by multipleconsumers. Various embodiments of the present technology include an APIfor consumers of bundles from the archive. The API can be very simple,but can make clear the responsibility of consumers of bundles (callersof the API—which are primarily the transformers) to declare what bundlesthey are asking for—which enforces the principle that consumers know thestate of bundles to consume. This allows for clarifying one-waydependency as well as fault recovery responsibilities. For example, theAP can be a single function such as the following: MsgArchive::GetBundles (data stream, start number, max bundles, max data size, max waittime).

Data consumers also can interact with the archive indirectly throughevents that are fired after archive write. Publishing of all data insmall bundles can allow various embodiments of the technology tofunction efficiently and flexibly when small messages requiring atomichandling and real-time load increase in frequency.

Transformers may only partially transform data (i.e., they areresponsible for maintaining the basic schema, aggregation and pivot ofthe structured data they consume from the archive service). The“transformation” is a conversion to an upload-friendly format. Eachtransformer may consume a consecutive sequence of bundles for a specificdata stream and convert it into a form that is loadable and uploads itto the Transform Repository Service. Oftentimes, the product of atransformer can be consumed by multiple loader processes.

Transformers can be stateless and may make incremental progress on thedata stream. As a result, the transformer may pick up state from theiroutput (from the transform repository), and so, emergency maintenance ispossible by simply deleting or moving output in the repository; thetransformer will continue work from the end of the latest batch in therepository in which it writes. Consumption follows the Subscription WithHeartbeat pattern that facilitates message push and pull according todynamic message cargo configuration. The transformer can operate on aper bundle stream basis, allowing each stream to be generated inisolation and prevent one bundle stream from holding up another. Inaccordance with various embodiments, transformers may be owned by theproducer/provider of data and are the sole consumers of raw data bundlesto ensure that external consumers do not directly consume bundlemessages. Instead, transformers can consume the product oftransformation according to a declared contract.

This service is much like the Archive Service for bundles, but it storestransformed product of the transformers. Transformed data packages canbe significantly larger than bundles, consisting of multiple bundles inan upload package.

Loaders upload optimized packages from the Transform Repository Serviceinto a designated target store. The loader configuration is provided atruntime and includes Transform Repository Service, target store and dataschema information. Much like the transformer/archive servicerelationship, loaders can subscribe to Transform Repository controlsignals and pick up what is needed from the Transform RepositoryService. Also like the transformer, loaders leverage state in thetargeted stateful store to ensure the bundles received are not repeated.This responsibility is made clear by the fact that the transformer APIrequires the loader to give the starting bundle ID of batches to fetch.Only the loader has insight into load state in its target store and howit has progressed, so it makes sense for the loader to drive therequests for data. Note that this does not mean that the loaders arepurely polling. Like transformers, loaders leverage the SubscriptionWith Heartbeat approach to collecting data that gives the best of pushefficiency with the added monitoring reliability of heartbeats.

Loaders create an additional Load Control Stream into the target storefor every data stream. This second stream is a store-specific load tablehistory and checksum reconciliation result. It represents a particulardata target's load record and checksum results. Note that the datastream is independent of this load record stream.

Various streams created by embodiments of the technology can be designedto facilitate high volume loading into a variety of reporting stores.For this reason, and because update-in-place is not an option for manytypes of stores, data streams are often best captured as tables ofimmutable events. When taking this approach, the data table becomeslogically identical across all such reporting stores. It contains noinstance-specific concepts such as time stamps. All instance-specificinformation is captured in the Load Control Table. One advantage ofremoving time stamps is the simplicity created in the clean separationof a data table that is logically identical across systems, while theLoad Control Table is instance-specific. The unique identifiers can beused as a quasi-universal clock increment, so no time stamps would beneeded in a data table to understand relative versions. The Load ControlTable maps instance-specific times to stream time for a particular datastore.

Some embodiments support distributed transactions of any length or sizeand across any number of different data streams through a logical,rather than physical, approach to transactions. Transaction completenessis enforced at destination, and timeout is configurable. Transactionsare a type of logical data set with an added service for timeout.Distributed transactions can be achieved in one or more embodiments byflowing a special transaction condition on a separate channel from thedata and assigning all data of the intended transaction with atransaction ID. The transaction signal contains a transaction ID tag anda timeout threshold. Completion criteria consistent with a rule such as“RowCount” and a measure such as “2555” can be sent along with thisinitial signal if it is known, or sent at a later point if unknown. Alldata submitted with this identity is considered part of the transaction.The idea of logical transactions is to flow all transaction informationto every target data store, even pending and uncommitted transactiondata. The machinery of the transport processes pushes informationcompletely agnostic to transactions between streams.

Data elements of a transaction may be assigned a unique transaction ID.For components like the bundler, transformer and loader, this ID may bejust another dimension of the data with no special purpose. Thetransaction ID can be assigned to any data elements across one or morestreams without restriction. For sophisticated transaction handling ofcomplex granular transactions, gaps between elements that are assignedthe transaction ID are even allowable, and so, any transaction acrossdata elements can be specified.

Completion criteria for a transaction may be distributed to all end datastores via a separate transaction stream. Completion criteria can be assimple as the number of rows expected in every stream (or landing table)affected by the transaction, or as complex as rule-driven configurationallows. Completion criteria can include a time-out interval in someembodiments. From bundler, archive, transformer, and loaderperspectives, a transaction stream is a stream like any other within thedata distribution system. There does not have to be any special handlingthat is different from any other high-speed, light-weight stream (likeuser edit streams). Completion criteria can arrive at any time before orafter the data to which it refers.

As a consequence of this, and in order to elegantly support distributedtransactions, transaction IDs for which there are no completion criteriamay be considered pending transactions, and all data associated withthem may be considered incomplete. IOW, the presence of a non-zerotransaction ID, is all that is needed to mark data as part of atransaction, and absence of any supporting transaction information isinterpreted exactly as a transaction that had not yet met completioncriteria.

A special Transaction Completion Process (or TCP) may run against adesignated authoritative reporting instance and subscribes to all loaderevents that may affect the transaction streams that it handles. Thisprocess encapsulates all the logic for determining if a transaction iscomplete based upon completion criteria and data that has arrived. Notethe implementation and configuration advantages of confining thiscomplex logic to one process and advantages of keeping this complexityout of the reporting/query API. TCP may hold a simple cache keyed by thetransaction ID of all pending transactions. As data arrives, it wouldupdate transaction completion status in the End Data Store and removethe completed transaction from the cache. Like other robust systemcomponents, TCP could leverage only the end data store state forstart-up/recovery. Completion status can be updated by publishing to theTransaction Stream. This gives consistent transaction status to alldistributed stores.

Completion status events may contain the transaction ID and maximumbundle IDs of data that satisfies the transaction completion criteria.If a normalized table is used for transaction status with one row entryper data stream (a general, flexible and probably best approach),completion criteria can be specified only for tables where data isexpected, and the completion status bundle ID will be the max bundle IDof the data that completed the transaction. Similar to other pipelinecomponents that process streams, TCP publishes start and end handlingfor all transaction stream bundles (for consistency of monitoring,support and extension). TCP does not need to publish receiving signalsfrom other data streams, however, because it is not transacting writesor commits in those streams. TCP handles Time-Outs by marking the statusof any transaction exceeding a given time limit as ‘Timed Out’ (and alsoremoving the timed out transaction from the cache of pendingtransactions).

Reporting queries leverage completion criteria in the Transaction Streamto query data when the Bundle ID is less than the Lowest Pending TranBundle ID. The simplest interpretation of transactions is to ignore anydata bundles that are equal to or greater than the start of pendingtransactions where Transaction Status is not equal to a Timed Out,thereby ignoring timed-out transactions

For more complex transaction handling, TCP would be enhanced to checkits cache of pending and progressing transactions for overlap onspecified key spaces (combination of dimensions) and would publish/writetransaction failure for latter transactions that collided with keyspaces of other pending transactions.

Conveniently, the increasing complexity that is put into TCP forgranular transactions can reduce the complexity of queries. Queries ofcomplex granular transactions need not select from Lowest Pending TranBundle ID; they can simply select only where transactions have asuccessful complete status. Notice that by designing transactioncompletion handling into a single process, it becomes easier toconfigure different processing strategies for sets of streams asconfiguration rules when starting up a TCP. Sophistication oftransaction processing can grow without complicating other core systemcomponents.

Because transaction assignment is orthogonal to bundles and completelyflexible in how it can apply across data, transaction implementationscan grow more sophisticated than with a bundle-dependent approach.Business-specific handling of what constitutes a collision can beencapsulated in the TCP, and query logic as a result can become evensimpler than with less granular block-based selection.

Because processes within the system keep purely to their basictransportation responsibility, performance abnormalities and problemscan be more easily isolated and understood. Load balancing andoptimizations are not complicated by table-locking or any other type ofinter-play or wait-for-completion in the flow. Since all transactioninformation is published to the end data store, it is reportable. Itbecomes a simple matter to investigate excessive numbers of collidedtransactions or time-outs.

Various embodiments of the technology keep primary components as purelya transportation infrastructure independent of business transactionalrequirements. As a result, these embodiments are able to reduce thefuture likelihood of processes that need to “peer into” the guts ofsystem to spy pending transaction activity or other state. Applicationactivity can instead leverage end data stores which are better designedto support processes and tools.

Logical transactions work in conjunction with normal non-transactionalflow via optimistic locking. The transactions are feeble (i.e.,optimistic) compared to the data flow in that any flow that conflictswith a transaction will cause the transaction to fail, not the flow. Itis possible to make transactions strong (i.e., locking) by creatinglogical flow failures using locking transactions that invalidate alldata that arrives during a transaction that is not part of thetransaction. However, backing up flow on the data distribution system(not physically, but through logical invalidation) seems against thespirit of guaranteed eventually consistent distribution. Such a feature,if ever needed, would have to be used with great care.

Logical transactions take advantage of hardware trends where storagecapacity has grown and continues to grow many-fold, but IO capabilitiesare lagging behind that growth. Capturing and storing all proposedtransactions is cheap; the waste of space is not such a concern. Asmentioned above, having failed transaction information available forsome time is great for understanding and tuning the system.

There can be a considerable difference between schemas for managingflowing or changing data and schemas for reporting static data. Curationcan be used in some embodiments for converting transaction-riddendistributed data into cleaner forms better suited to historical query.

Some embodiments can use a dynamic message cargo to dynamically optimizemessage flow for messages of variable size. All message bundles canstill be stored in the Archive in complete form, including control,index, quality, and data segments. For large messages (the “large”threshold is configurable in the bundler), all but the data section maybe published on the EMS bus, signaling successful archiving of thebundle. The transformers then reach back to request the data bundle fromthe archive. However, for large counts of small messages, we can publishthe entire message including the data section, obviating the need forconsumers to come back and request the data.

Transformers consuming dynamic messages then have a contract to respectthe data content that is sent to them. For pure control messages, theyreach back and grab the large data from archive. For small messages,they can simply go ahead and consume the data in the message and forgothe extra trip (and spare the archive server from getting swamped withrequests for small datasets). This dynamic strategy becomes veryimportant as we ramp up different message flows with edits, blessing andvariable message contents. Flows that are suitable for push will push.Flows with very large datasets suitable for file-store-and-fetch will dothat automatically without any adjustment to the various serverprocesses. The system becomes highly tunable by adjustments to bundlingsizes, message thresholds, and compression strategies.

Packet size tuning can be left to operational teams that manageinfrastructure. Infrastructure can be expected to change frequently andbe different from environment to environment. Tiered storage solutionswill evolve, and optimal packaging for transport changes almost everytime hardware is upgraded. This is why bundles are given athree-threshold configuration: by time threshold, by size threshold, andby logical row count threshold.

Subscription With Heartbeat is a combined approach to subscribing tonon-durable messages while occasionally polling. The heartbeat (poll)alleviates durable store-and-forward responsibilities on producers andgives flow architects the ability to choose between durable andnon-durable publication of control. However, benefits of a heartbeat gobeyond simple monitoring checks when heartbeat is coupled with recoverylogic; the combination gives a very simple and robust implementation ofrecovery.

As an example, consider the following transformer's consumption ofbundles. A transformer starts up by asking the transform repository howfar it has gone in processing—in other words, what is the next bundle toprocess and store in the repository. It then calls the archive servicefor the next n bundles to process (whatever seems like a reasonablenumber). The archive service returns the bundles. After processing thebatch, the transformer waits some time for a signal to come from theArchive Service indicating a desired number of bundles are ready for thenext batch.

If bundles are ready, the transformer requests the next step—repeatingstep 1. If no bundles are ready and there has been no incoming activity,the transformer will wait for the configured time interval before goingand requesting bundles anyway. Remember, signals are not durable, sothere is a small chance that no signal will arrive, but the transformermust go and ask to be certain. If there are still no bundles, then thetime decay adds some amount to wait for the next heartbeat check. Theheartbeats keep slowing down. When bundles are finally available again,they are fetched, and the heartbeat rate is increased again for morefrequent polling in absence of control messages.

Data targets bear some responsibility for allowing the data distributionsystem to ensure consistency. They must make data readable back to thedata distribution system if the data distribution system is to manageconsistency of the data created. However, this is normally a small costrelative to the gain from a reduced need for reconciliations betweentargets.

Data versioning can be used for getting consistent data across differenttarget data stores. For the data distribution system clients, theversion strategy requires a shift away from traditional ACID approachesand into an approach where every request for data returns not only thedata, but also the stream clock information. Queries that need to beconsistent across different destinations need to provide the streamclock time (the bundle ID) in order to get the same results, orfail/warn if data is not yet available. However, the data distributionsystem's contract is that the data will be consistent eventually.

Updates of data are essentially just replacements of the same data,where “same” is determined by an identity strategy. Identity strategiesare basically a selection of key attributes by which to rank/order andselect. Later data arriving in a stream of the same identity replacesearlier data. However, data is not always controlled or replaced orinvalidated by identity. In some cases, it makes sense to manage data byattributes outside of identity. Logical data sets, explained below, areanother way to manage collections of data not by identity, but by otherattributes such as by whom the data was submitted or for what purpose.

The identity service is another example of how a data flow that hasembedded declarative structure can have great advantages overunstructured flows. The identity services can generate short integeridentity keys in the flow for a subset of chosen columns and enrich thedataset using these keys. Such keys serve as a performance boost forreport operations in any target store. This is especially effective whenmodeling broad, denormalized streams, which is often a practice fordistributed sourcing. Currently, we plan to use daily-generated key setsbecause it is only within the daily flow where the boost is needed.Historical key generation does not need to be done within the flow.

The load control table is the way to ensure the data for a particularseries is consistent with respect to checksums. Clients that wantabsolute assurance that they select only checked-out reconciled dataneed to use the load control table to determine the highest watermarkthat they can safely select. For example: Select from Dataset whereBundleID<(Select MIN (InvalidBundleStatus) from DatasetControl) . . .where InvalidBundleStatus is determined by finding the lowest bundle IDthat has not yet been successfully validated or by returning MAX bundleID+1. Such an algorithm treats all pending validations as invalid untilrun.

While this adds some complexity overhead to clients that care aboutbundle check status, the leverage gained from bundle size relative toindividual data items (typically in the 1000's to 1), means nonoticeable performance overhead. One advantage of such a system is thatit puts criticality of checksum failure in the hands of the client. Itcan very well be the case in very large data flows that a client stillwants access to data, even knowing a checksum failed. The client mayvery well decide, based on the immateriality of the data, to usesubsequent data while the bad data is being repaired.

Logical data sets are an extension of the idea used for creatingtransactions. Since bundles are a physical packet domain ownedcompletely by infrastructure operations for tuning flow, a separatelogical construct must be introduced to manage any type of applicationdata set. A logical data set is created by sending out a logical dataset definition with a unique identity and then tagging all subsequentlysubmitted data with that identity. The logical data set definition ispublished on a separate stream with a separate structure from the datastream(s) to which it refers. Clients can then select properties of thelogical set stream and select out the data to which it applies.

Logical data sets help facilitate completeness functionality. Similar totransactions, completeness criteria supplied in the logical data streamand tagged against relevant data can indicate whether all data within alogical set is available in a particular reporting store.

Transactions are one type of completeness logical data set with an addedfactor of timeout checking. Since logical sets are represented asstreams, they enjoy the same bundle and versioning features as regulardata streams. This means that logical set information can evolve just asdata in data streams. And so a series of logical events is queued up ina stream, and there is a logical clock against which to align all ofthese logical events across systems. A good strategy is to allow logicalset events to show up as included or excluded from default generalconsumption. The logical event stream can, in this way, form a commonperspective on the data for all consumers, while consumers may choose toinclude or ignore certain components in their own preview—and then pushout status updates to the logical set stream to make desired aspectspublic. The official common view of data and events can then becommunicated as stream/max bundle ID pairings for both data sets andlogical data sets. This is perhaps the most concise and intuitive way toexpress such a selection of complex streamed events.

The publication of a set of approved pairings of streams and bundle IDsis known as a “Blessing”. Blessings can be published by any streamauthority who has determined that the data stream as of a particularpoint is good for consumption. Blessings of parallel-delivered streamsprovide parallel availability of large data sets andcoordination/consistency in using those data sets.

Edits within some embodiments are another form of logical dataset. Datathat is being modified is simply added to a stream to replace the priorversions of that data with a unique edit identity to tie it to all otherdata in the edit. In this way, an edit can be included or ignored.Deletes are a special form of modification. A delete may be implementedas an invalidation of a particular logical set. For example, a badsourcing run could be “deleted” by ignoring the logical set associatedwith that sourcing run. Deletes of individual items independent ofinvalidating any particular logical set, would be done by submitting adata item with the same identity of what is to be deleted and a logicaldelete status set to true in the data table.

Fault tolerance can be achieved through dual flows off of a replicatedmessage archive. Given that all data can be regenerated off of thearchive, the only critical point of consideration is how to create faulttolerance of the archive itself.

Exemplary Computer System Overview

Embodiments of the present technology include various steps andoperations, which have been described above. A variety of these stepsand operations may be performed by hardware components or may beembodied in machine-executable instructions, which may be used to causea general-purpose or special-purpose processor programmed with theinstructions to perform the steps. Alternatively, the steps may beperformed by a combination of hardware, software, and/or firmware. Assuch, FIG. 8 is an example of a computer system 800 with whichembodiments of the present technology may be utilized. According to thepresent example, the computer system includes a bus 810, at least oneprocessor 820, at least one communication port 830, a main memory 840, aremovable storage media 850, a read only memory 860, and a mass storage870.

Processor(s) 820 can be any known processor, such as, but not limitedto, an Intel® processor(s); AMD® processor(s); ARM-based processors; orMotorola® lines of processors. Communication port(s) 830 can be any ofan RS-232 port for use with a modem-based dialup connection, a 10/100Ethernet port, or a Gigabit port using copper or fiber. Communicationport(s) 830 may be chosen depending on a network such as a Local AreaNetwork (LAN), Wide Area Network (WAN), or any network to which thecomputer system 800 connects.

Main memory 840 can be Random Access Memory (RAM) or any other dynamicstorage device(s) commonly known in the art. Read only memory 860 can beany static storage device(s) such as Programmable Read Only Memory(PROM) chips for storing static information such as instructions forprocessor 820.

Mass storage 870 can be used to store information and instructions. Forexample, hard disks such as the Adaptec® family of SCSI drives, anoptical disc, an array of disks such as RAID, such as the Adaptec familyof RAID drives, or any other mass storage devices may be used.

Bus 810 communicatively couples processor(s) 820 with the other memory,storage and communication blocks. Bus 810 can be a PCI/PCI-X or SCSIbased system bus depending on the storage devices used.

Removable storage media 850 can be any kind of external hard-drives,floppy drives, IOMEGA® Zip Drives, Compact Disc-Read Only Memory(CD-ROM), Compact Disc-Re-Writable (CD-RW), and/or Digital VideoDisc-Read Only Memory (DVD-ROM).

The components described above are meant to exemplify some types ofpossibilities. In no way should the aforementioned examples limit thescope of the application, as they are only exemplary embodiments.

In conclusion, the technology of the present application provides novelsystems, methods and arrangements for structured data distribution.While detailed descriptions of one or more embodiments of the technologyhave been given above, various alternatives, modifications, andequivalents will be apparent to those skilled in the art without varyingfrom the spirit of the application. For example, while the embodimentsdescribed above refer to particular features, the scope of thisapplication also includes embodiments having different combinations offeatures and embodiments that do not include all of the describedfeatures.

What is claimed is:
 1. A method comprising: receiving streaming datafrom a data producer; determining business aligned archive sequence fromthe streaming data that should be bundled together in accordance with aset of bundling parameters; bundling the data into packages of datahaving a standard format; ordering each of the packages of data using aseries of consecutive integers produced by a master clock; publishingmetadata regarding availability of the packages of data on a controlchannel; and delivering the packages of data to data consumers whichhave subscribed to the data producer.
 2. The method of claim 1, whereinthe bundling parameters include declarative rules specified by abusiness.
 3. The method of claim 1, wherein the standard format is astandard neutral format biased for movement and compression of thestreaming data.
 4. The method of claim 1, further comprising replayingthe packages of data based on the ordering upon a request from a dataconsumer.
 5. The method of claim 1, further comprising archiving thepackages of data in a platform independent manner.
 6. The method ofclaim 1, wherein the metadata that is published on the control channelincludes indexes.
 7. The method of claim 1, wherein bundling the dataincludes identifying data that when compressed will result in eachpackage in the packages of data having a desired size.
 8. The method ofclaim 1, wherein bundling the data includes associating new metadatawith each of the bundled packages, wherein the new meta data comprisesat least one of summary data, quality data, index data, or checksumdata.
 9. The method of claim 1, wherein the delivery of the packages ofdata to the consumers comprises parallel delivery.
 10. The method ofclaim 1, further comprising using columnar checksums for verifying thedata.
 11. The method of claim 10, wherein the columnar checksums allowfor rounding errors with a specified tolerance.
 12. A system comprising:a bundler configured to receive streaming raw data from a data producerand bundle the raw data into a series of data packages and associatewith each of the data packages a unique identifier having amonotonically increasing order based on upload from the data producer; atransformer to receive the data packages having the associated uniqueidentifier and generate loadable data structures for a reporting storeassociated with a data subscriber; and a loader to receive and store theloadable data structures into a storage device associated with the datasubscriber based on the monotonically increasing order.
 13. The systemof claim 12, wherein the streaming raw data comprises multiple streams.14. The system of claim 13, wherein the data packages from each of themultiple streams are assigned different sets of unique identifiers. 15.The system of claim 13, wherein each of the multiple streams ofstreaming raw data are assigned a flow priority.
 16. The system of claim12, further comprising an identification module to receive a logicalseries of integers from the stream clock and generate the uniqueidentifier having the logical ordering.
 17. The system of claim 16,further comprising a stream clock configured to generate the logicalseries of integers, and wherein a single integer is the uniqueidentifier associated with a single data package in the series of datapackages.
 18. The system of claim 12, further comprising: a data channelallowing data from a data producer to be continuously streamed to thedata subscriber through the bundler; a messaging channel to provide acurrent status of the data being continuously streamed from the dataproducer to the data subscriber; and a control channel separate from thedata channel to allow the data subscriber to request replay of the data.19. The system of claim 18, wherein the control channel is running at afaster rate than the data channel.
 20. The system of claim 18, whereinthe control channel recursively publishes metadata regarding the datapackages.
 21. The system of claim 12, further comprising an archivingservice to archive the data packages.
 22. A method comprising: receivinga request to replay data bundled into data packages having a logicalordering assigned to the data packages before being stored in anarchive, wherein the request includes a logical bound on the data to bereplayed and identifies a format for the data subscriber; retrieving,from the archive, data consistent with the logical bound; andtransforming the data packages into the loadable format identified inthe request to replay the data.
 23. The method of claim 22, wherein datapackages after the logical bound are ignored.
 24. The method of claim22, further comprising: receiving a selection of an archiving strategy;and archiving the data in accordance with the archiving strategy. 25.The method of claim 24, wherein the archiving strategy is adimension-based archiving strategy.
 26. The method of claim 22, whereinthe data packages are compressed using columnar compression.
 27. Themethod of claim 22, wherein the data packages include metadata with eachproviding summary data, quality data, index data, or checksum data.